I chose a churn dataset. It is a dataset relating characteristics of telephony account features and usage and whether or not the customer churned.
In plots placed below we can see that all feature variables are normally distributed. That is important because in such case no data transformation is needed.
However, predicted variable has quite unbalanced distribution. For that reason, in order to evaluate model appropriately, beside standard accuracy metric I also compute f1-score and recall metrics.
We can see that it was worth to use recall and f1-score metrics.
Logistic regression model achieves decent accuracy but at the same time its recall and f1-score are terrible. Random forest performs much better and significantly outperforms regression. It also overfits heavily as it gets nearly 100% score on train dataset in all metrics.
But the best model overall is TabPFN. I fitted it with data limited to 1000 rows due to issue with RAM consumption. However, it was still able to obtain better results on test dataset than random forest. The difference was not as big here as between first two models but score achieved by random forest was already quite good.
!pip install -q tabpfn
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("churn.csv", index_col=0)
print(f'Dataset has {data.shape[0]} rows and {data.shape[1]} columns')
display(data.head())
Dataset has 5000 rows and 9 columns
| total_day_minutes | total_day_charge | total_eve_minutes | total_eve_charge | total_night_minutes | total_night_charge | total_intl_minutes | total_intl_charge | TARGET | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 265.1 | 45.07 | 197.4 | 16.78 | 244.7 | 11.01 | 10.0 | 2.70 | 0 |
| 1 | 161.6 | 27.47 | 195.5 | 16.62 | 254.4 | 11.45 | 13.7 | 3.70 | 0 |
| 2 | 243.4 | 41.38 | 121.2 | 10.30 | 162.6 | 7.32 | 12.2 | 3.29 | 0 |
| 3 | 299.4 | 50.90 | 61.9 | 5.26 | 196.9 | 8.86 | 6.6 | 1.78 | 0 |
| 4 | 166.7 | 28.34 | 148.3 | 12.61 | 186.9 | 8.41 | 10.1 | 2.73 | 0 |
All feature variables are normally distributed. However, predicted variable has quite unbalanced distribution.
fig, ax = plt.subplots(3, 3, figsize=(14, 10))
for i in range(8):
ax[i // 3, i % 3].hist(data.iloc[:, i])
ax[i // 3, i % 3].set_title(data.iloc[:, i].name)
fig.tight_layout()
data['TARGET'].plot(kind='hist')
<AxesSubplot: ylabel='Frequency'>
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, recall_score, roc_auc_score
x, y = data.iloc[:, :-1], data.iloc[:, -1]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
def compute_metrics(pred_train, pred_test):
train_acc = accuracy_score(y_train[:pred_train.shape[0]], pred_train)
train_f1 = f1_score(y_train[:pred_train.shape[0]], pred_train)
train_recall = recall_score(y_train[:pred_train.shape[0]], pred_train)
train_roc_auc = roc_auc_score(y_train[:pred_train.shape[0]], pred_train)
test_acc = accuracy_score(y_test, pred_test)
test_f1 = f1_score(y_test, pred_test)
test_recall = recall_score(y_test, pred_test)
test_roc_auc = roc_auc_score(y_test, pred_test)
display(pd.DataFrame(
{
'Train': [train_acc, train_f1, train_recall, train_roc_auc],
'Test': [test_acc, test_f1, test_recall, test_roc_auc]
},
index=['Accuracy', 'f1-score', 'Recall', 'ROC AUC']
))
from sklearn.linear_model import LogisticRegression
logisticRegression = LogisticRegression().fit(x_train, y_train)
compute_metrics(logisticRegression.predict(x_train), logisticRegression.predict(x_test))
| Train | Test | |
|---|---|---|
| Accuracy | 0.860500 | 0.865000 |
| f1-score | 0.037931 | 0.055944 |
| Recall | 0.019366 | 0.028777 |
| ROC AUC | 0.509537 | 0.514388 |
from sklearn.ensemble import RandomForestClassifier
randomForest = RandomForestClassifier().fit(x_train, y_train)
compute_metrics(randomForest.predict(x_train), randomForest.predict(x_test))
| Train | Test | |
|---|---|---|
| Accuracy | 0.999750 | 0.889000 |
| f1-score | 0.999119 | 0.468900 |
| Recall | 0.998239 | 0.352518 |
| ROC AUC | 0.999120 | 0.664064 |
from tabpfn import TabPFNClassifier
train_size = 1000 # Take only a part of data because there is an issue with RAM consumption when fitting on full dataset.
tabPFN = TabPFNClassifier(device='cpu', N_ensemble_configurations=10).fit(x_train[:train_size], y_train[:train_size])
compute_metrics(tabPFN.predict(x_train[:train_size]), tabPFN.predict(x_test))
Loading model that can be used for inference only Using a Transformer with 25.82 M parameters
| Train | Test | |
|---|---|---|
| Accuracy | 0.907000 | 0.895000 |
| f1-score | 0.532663 | 0.482759 |
| Recall | 0.395522 | 0.352518 |
| ROC AUC | 0.690833 | 0.667548 |